The Gapminder World dataset is about people from around the world. It describes their lives using data from trustworthy sources, such as the UN. The raison d'être of this dataset is to reduce the gap between between popular misconception and reality.
Certain topics are rife with misconception. One such topic is human progress & development. People underestimate the developing world, especially in regards to sustainable development—including people in developing countries.[1][2] Likewise, people may overestimate the sustainability of developed countries. For instance, Northern Europeans overestimate the role of clean energy in the world today—ironic, given their proclivity for burning wood and oil for warmth.[3][4] Misconceptions like these could lead to trouble. If Brits were to assume that the clean energy revolution had already occurred, then they might take their foot off the gas—so to speak—on thwarting climate change.
These sorts of misconceptions can be addressed using the Sustainable Development Index (SDI), which includes 164 countries going back three decades. The SDI divides each country's Human Development Index by its "Ecological Impact Index," a metric for excessive pollution or overconsumption of raw materials.
$$ \large{SDI = \dfrac{HDI}{EII}} $$Thus, the SDI penalizes countries for using up more than their equitable share of resources. But because the baseline $EII$ is 1, the SDI does not reward countries that use less than their alloted resources, either.[5] For these countries, SDI functions just like HDI.
This investigation uses ver. 2 of the SDI data, which includes three tables of supporting data. Two are used in the investigation:
This investigation will set out to determine just how developed a country can get before it starts using up more than its share of resources.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
# The `read_csv()` method includes a number of parameters for selectively
# "reading in" CSV files, which is not only easier on the computer but can give
# the programmer a head start on wrangling the data.
sdi = pd.read_csv('_Sustainable Development Index-Dataset - v2 - Unpivot-SDI'
'-for-countries-etc.csv', header=3,
usecols=['name', 'Year', 'SDI(%)'])[['name', 'Year', 'SDI(%)']]
sdi.head()
| name | Year | SDI(%) | |
|---|---|---|---|
| 0 | Afghanistan | 1990.0 | 32.5 |
| 1 | Afghanistan | 1991.0 | 33.1 |
| 2 | Afghanistan | 1992.0 | 34.0 |
| 3 | Afghanistan | 1993.0 | 33.5 |
| 4 | Afghanistan | 1994.0 | 33.0 |
sdi.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5790 entries, 0 to 5789 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 4611 non-null object 1 Year 4611 non-null float64 2 SDI(%) 4611 non-null float64 dtypes: float64(2), object(1) memory usage: 135.8+ KB
There are 5790 entries per column, only 4611 of which are valid.
The year data is of the wrong type, float64.*
*Float64 is normally used to represent fractional values—hence the `.0`'s at the end.
gnipc = pd.read_csv('_Sustainable Development Index-Dataset - v2 -'
' data gdpcapcppp@fasttrack year countries_etc.csv',
index_col='name')
gnipc.head()
| geo | time | Income per person | |
|---|---|---|---|
| name | |||
| Afghanistan | afg | 1800 | 603 |
| Afghanistan | afg | 1801 | 603 |
| Afghanistan | afg | 1802 | 603 |
| Afghanistan | afg | 1803 | 603 |
| Afghanistan | afg | 1804 | 603 |
gnipc.info()
<class 'pandas.core.frame.DataFrame'> Index: 46995 entries, Afghanistan to Zimbabwe Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo 46995 non-null object 1 time 46995 non-null int64 2 Income per person 46995 non-null int64 dtypes: int64(2), object(1) memory usage: 1.4+ MB
The geo-id column seems redundant.
The year data is labeled, somewhat confusingly, as 'time.'
The 'Income per person' column does not specify the unit of measure.
mfpc = pd.read_csv('_Sustainable Development Index-Dataset - v2 -'
' data_matfootp_cap#v1@fasttrack_year_countries_etc.csv',
index_col='name')
mfpc.head()
| geo | time | Material footprint per capita (tonnes) | |
|---|---|---|---|
| name | |||
| Afghanistan | afg | 1990 | 2.46 |
| Afghanistan | afg | 1991 | 2.81 |
| Afghanistan | afg | 1992 | 2.06 |
| Afghanistan | afg | 1993 | 1.87 |
| Afghanistan | afg | 1994 | 1.60 |
mfpc.info()
<class 'pandas.core.frame.DataFrame'> Index: 4816 entries, Afghanistan to Zimbabwe Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo 4816 non-null object 1 time 4816 non-null int64 2 Material footprint per capita (tonnes) 4816 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 150.5+ KB
The year data is labeled 'time' here as well.
Geo still seems redundant.
'Material footprint per capita (tonnes)' is much too long a name.
sdi.dropna(axis=0, inplace=True)
sdi['Year'] = sdi.Year.astype(int)
sdi.set_index('name', inplace=True)
sdi.head()
| Year | SDI(%) | |
|---|---|---|
| name | ||
| Afghanistan | 1990 | 32.5 |
| Afghanistan | 1991 | 33.1 |
| Afghanistan | 1992 | 34.0 |
| Afghanistan | 1993 | 33.5 |
| Afghanistan | 1994 | 33.0 |
sdi.isnull().any()
Year False SDI(%) False dtype: bool
Drop the redundant, geo-id column.
Rename mislabeled columns:
gnipc.drop(columns='geo', inplace=True)
gnipc.columns = ['Year', 'GNIpc (PPP$2017)']
gnipc.head()
| Year | GNIpc (PPP$2017) | |
|---|---|---|
| name | ||
| Afghanistan | 1800 | 603 |
| Afghanistan | 1801 | 603 |
| Afghanistan | 1802 | 603 |
| Afghanistan | 1803 | 603 |
| Afghanistan | 1804 | 603 |
Drop the redundant, geo-id column.
Rename mislabeled columns:
mfpc.drop(columns='geo', inplace=True)
mfpc.columns = ['Year', 'MFpc (tons)']
mfpc.head()
| Year | MFpc (tons) | |
|---|---|---|
| name | ||
| Afghanistan | 1990 | 2.46 |
| Afghanistan | 1991 | 2.81 |
| Afghanistan | 1992 | 2.06 |
| Afghanistan | 1993 | 1.87 |
| Afghanistan | 1994 | 1.60 |
sdi_latest = sdi.query('Year == 2019')
sdi_latest.sort_values(by='SDI(%)', ascending=False).head()
| Year | SDI(%) | |
|---|---|---|
| name | ||
| Costa Rica | 2019 | 85.3 |
| Sri Lanka | 2019 | 84.3 |
| Georgia | 2019 | 83.9 |
| Armenia | 2019 | 82.7 |
| Albania | 2019 | 82.6 |
The top countries are Costa Rica, Sri Lanka, Georgia, Armenia and Albania.
sdi_latest.sort_values(by='SDI(%)', ascending=False).tail()
| Year | SDI(%) | |
|---|---|---|
| name | ||
| United States | 2019 | 18.1 |
| Australia | 2019 | 15.0 |
| United Arab Emirates | 2019 | 11.0 |
| Kuwait | 2019 | 9.9 |
| Singapore | 2019 | 7.9 |
When it comes to sustainable development, Australia and the United States are among the bottom—along with other highly developed, highly industrialized countries.
# As highly developed countries tend strongly to be high income, the GNIpc data
# can be used to sus out the highly developed countries.
gnipc_latest = gnipc.query('Year == 2019')#['GNIpc (PPP$2017)']
sdi_latest = pd.merge(sdi_latest, gnipc_latest, how='left', on='name')
sdi_latest.head()
| Year_x | SDI(%) | Year_y | GNIpc (PPP$2017) | |
|---|---|---|---|---|
| name | ||||
| Afghanistan | 2019 | 55.1 | 2019 | 1763 |
| Angola | 2019 | 62.6 | 2019 | 5544 |
| Albania | 2019 | 82.6 | 2019 | 12694 |
| United Arab Emirates | 2019 | 11.0 | 2019 | 65650 |
| Argentina | 2019 | 77.7 | 2019 | 17529 |
sdi_latest.drop(['Year_x', 'Year_y'], axis=1, inplace=True)
# There are more countries than can fit on one comparison chart. One solution
# is to systematically select a sample to represent the rest.
sample = sdi_latest.index[1::5]
sample
Index(['Angola', 'Antigua and Barbuda', 'Belgium', 'Bahrain', 'Brazil',
'Central African Republic', 'Cote d'Ivoire', 'Cape Verde', 'Germany',
'Ecuador', 'Finland', 'Georgia', 'Guatemala', 'Indonesia', 'Iceland',
'Japan', 'South Korea', 'Libya', 'Morocco', 'Macedonia, FYR',
'Mozambique', 'Namibia', 'Norway', 'Panama', 'Portugal', 'Rwanda',
'El Salvador', 'Slovenia', 'Chad', 'Trinidad and Tobago', 'Ukraine',
'Vietnam', 'Zambia'],
dtype='object', name='name')
sample = pd.Series(sample)
sdi_latest_sample = pd.merge(sdi_latest, sample, how='right', on='name')
sdi_latest_sample.head()
| name | SDI(%) | GNIpc (PPP$2017) | |
|---|---|---|---|
| 0 | Angola | 62.6 | 5544 |
| 1 | Antigua and Barbuda | 62.2 | 24463 |
| 2 | Belgium | 42.9 | 43517 |
| 3 | Bahrain | 48.8 | 41966 |
| 4 | Brazil | 75.4 | 14307 |
sdi_latest_sample.set_index('name', inplace=True)
sns.set(rc={"figure.figsize":(10, 7)})
sns.set_style('white')
fig = plt.figure()
# Arranging the countries by GNIpc allows for easier comparisons between highly
# developed countries on the one hand and everyone else on the other.
data = sdi_latest_sample.sort_values('GNIpc (PPP$2017)')
ax = sns.barplot(data=data, x=data.index, y='SDI(%)', palette="mako_r")
fig.suptitle('Sustainable Development', y=1.02, fontsize=20, fontweight='bold')
ax.set_title('A Comparison of Countries ', fontsize=18)
ax.set_ylabel('Sustainable Development Index (%)', fontsize=12)
ax.set_xlabel('Country')
ax.set_xticks(np.arange(len(data)))
ax.set_xticklabels(data.index, rotation=45, ha='right')
ax2 = ax.twiny()
ax2.set_xlabel('GNIpc (%ile)')
ax2.set_xticks(np.arange(5), ['min', '25', '50', '75', 'max'])
mean_sdi = data['SDI(%)'].mean()
plt.axhline(mean_sdi, color='blueviolet', linestyle=':', label='mean SDI')
plt.legend(loc=(.79, .89));
It seems that highly developed countries do tend to struggle with sustainable development. This is exemplified by Norway, which has the highest GNIpc and the lowest SDI. Most of the other highly developed countries do not fare much better. Those that do are nonetheless well below the mean.
# Incorporating quartiles into the data itself allows for further, programmatic
# analysis.
quartiles = ['bottom', 'lower-middle', 'upper-middle', 'top']
data['GNIpc quartiles'] = pd.qcut(data['GNIpc (PPP$2017)'], 4, labels=quartiles)
# The bottom of the SDI barrel
sdi_q1 = data['SDI(%)'].quantile(0.25)
data.query('`SDI(%)` <= @sdi_q1').sort_values('SDI(%)', ascending=False)
| SDI(%) | GNIpc (PPP$2017) | GNIpc quartiles | |
|---|---|---|---|
| name | |||
| Central African Republic | 42.8 | 794 | bottom |
| Chad | 42.8 | 1743 | bottom |
| Slovenia | 42.5 | 33676 | upper-middle |
| Germany | 39.3 | 46173 | top |
| Japan | 30.5 | 39739 | top |
| South Korea | 27.4 | 37343 | top |
| Iceland | 23.2 | 47861 | top |
| Finland | 22.3 | 42383 | top |
| Norway | 19.7 | 66308 | top |
Most of the top-quartile countries re GNIpc are at the bottom of the Sustainable Development Index. They are joined there by Slovenia—a borderline, highly developed country.
The bottom quartile of the SDI is rounded out by Chad and the Central African Republic, which sit on the lower hinge itself (25ᵗʰ percentile). Even the least developed countries hold up better than the most developed ones in terms of SDI!
On the other hand, the midspread countries re GNIpc are at the top of the SDI.
# The cream of the SDI crop—sorted by GNIpc
sdi_q3 = data['SDI(%)'].quantile(0.75)
data.query('`SDI(%)` > @sdi_q3')
| SDI(%) | GNIpc (PPP$2017) | GNIpc quartiles | |
|---|---|---|---|
| name | |||
| Ukraine | 75.8 | 8480 | lower-middle |
| Ecuador | 78.3 | 10215 | lower-middle |
| Georgia | 83.9 | 10671 | lower-middle |
| Indonesia | 77.1 | 12061 | lower-middle |
| Brazil | 75.4 | 14307 | upper-middle |
| Libya | 75.7 | 14751 | upper-middle |
| Panama | 82.1 | 23315 | upper-middle |
| Trinidad and Tobago | 74.2 | 28522 | upper-middle |
At the top of the SDI, Panama and Trinidad & Tobago stand out for their relative affluence. But they are not on the same level as the most developed countries.
data['GNIpc (PPP$2017)'].describe()
count 33.000000 mean 19829.090909 std 17411.165500 min 794.000000 25% 6970.000000 50% 12061.000000 75% 33676.000000 max 66308.000000 Name: GNIpc (PPP$2017), dtype: float64
In fact, Panama's GNIpc is hardly above the mean average, despite being at almost twice the median. Such a pronounced difference indicates extreme right skew.
data['GNIpc (PPP$2017)'].skew()
0.9013158753711067
# Plotly interactive plots are great for exploring data—they will be used from
# here on. But their default, title layout could be better.
pd.options.plotting.backend = "plotly"
def reconfig_title():
'Reconfigure title layout of Plotly graph.'
title = '<b>' + fig.layout.title.text + '</b>'
fig.update_layout(title={'x':0.5,
'xanchor':'center',
'y':0.85,
'font_size':18,
'text':title,
});
# Since histograms have no problem fitting large amounts of data—the more the
# better—the full list of countries can be used.
fig = sdi_latest.plot.hist(x='GNIpc (PPP$2017)', histnorm='percent', marginal='box',
hover_name=sdi_latest.index, title='International'
' Distribution of Income per Person')
reconfig_title()
fig.show()
The leftmost, three bins constitute more than three quarters of all countries, including high-SDI standouts Panama and Trinidad & Tobago.
The top quartile stretches from 29k to 113k international dollars (66k sans outliers). Per the country comparison above, these are the highly developed countries that struggle with sustainable development. To increase their SDIs—and stave off resource depletion, these countries need to slow down their rate of consumption.
How can a 29k country and a 113k country both be on the same tier of development?
The function of GNIpc is asymptotic as it pertains to human development. In other words, there are limits to human development that no amount of money can surpass. For example, literacy rate cannot exceed 100%. Similarly, life expectancy everywhere is limited by senescence (aging).
However, it may not be bearable to people in, say, Iceland to slow down their rate of consumption to the level of people in Panama—or even people in Bahrain, for that matter. Icelanders are just accustomed to a more luxurious standard of living than Panamanians and Bahrainians—even if Bahrain is a highly developed country. But perhaps there is a country as rich as Iceland yet more sustainable*, which can serve as a model for them.
*Or less unsustainable, at least.
As a measure of sustainability, MFpc will suffice. In addition to being a measure of consumption, MFpc is indicative of overall environmental impact in most cases.
# The MFpc data only goes up to 2017.
mfpc_latest = mfpc.query('Year == 2017')['MFpc (tons)']
sdi_latest = pd.merge(sdi_latest, mfpc_latest, how='left', on='name')
sdi_latest.head()
| SDI(%) | GNIpc (PPP$2017) | MFpc (tons) | |
|---|---|---|---|
| name | |||
| Afghanistan | 55.1 | 1763 | 1.20 |
| Angola | 62.6 | 5544 | 3.34 |
| Albania | 82.6 | 12694 | 11.57 |
| United Arab Emirates | 11.0 | 65650 | 49.11 |
| Argentina | 77.7 | 17529 | 14.78 |
# Note: A low MFpc suggests high sustainability.
x = sdi_latest['GNIpc (PPP$2017)']
y = sdi_latest['MFpc (tons)']
# A trendline can help visually separate the relatively lower MFpc countries
# all along the income scale—the better to find outliers with.
fig = sdi_latest.plot.scatter(y=y, x=x, trendline='ols', title='Material Foot'
'print vs Income', hover_name=sdi_latest.index)
reconfig_title()
# Likewise, a vertical line at the 3rd quartile can cordon off the high income
# countries. The area that they demarcate can be highlighted to further draw
# attention to it.
z = np.polyfit(x, y, 1)
f = np.poly1d(z)
gnipc_q3 = x.quantile(0.75)
gnipc_max = x.max()
fig.add_trace(go.Scatter(x=[gnipc_q3,
gnipc_q3,
gnipc_max,
gnipc_max,
gnipc_q3,
],
y=[0,
f(gnipc_q3),
f(gnipc_max),
0,
0,
],
mode='lines', fill='toself', fillcolor='green',
opacity=.2, hoverinfo='skip', showlegend=False))
fig.add_trace(go.Scatter(x=[gnipc_q3, gnipc_q3, gnipc_q3], y=[-3, 83, -3],
mode='lines', hoverinfo='skip', name='Q3'))
The following, high-income countries stand out in terms of low material footprint, at least in relation to their income: Oman, Bahrain, Saudi Arabia, Brunei, Ireland, and Qatar.
sdi_latest.loc[['Oman', 'Bahrain', 'Saudi Arabia', 'Brunei', 'Ireland', 'Qatar']]
| SDI(%) | GNIpc (PPP$2017) | MFpc (tons) | |
|---|---|---|---|
| name | |||
| Oman | 63.1 | 35758 | 10.34 |
| Bahrain | 48.8 | 41966 | 14.37 |
| Saudi Arabia | 45.6 | 48115 | 12.33 |
| Brunei | 33.8 | 72376 | 20.18 |
| Ireland | 42.4 | 72413 | 21.50 |
| Qatar | 26.0 | 113331 | 12.82 |
sdi_latest.mean()
SDI(%) 57.280982 GNIpc (PPP$2017) 19105.883436 MFpc (tons) 13.227055 dtype: float64
Yet most of these countries have low SDIs:
Their low SDIs must be on account of unsustainability—it's certainly not for lack of development! Likewise, their unsustainability must be due to high, carbon-dioxide (CO₂) emissions. Recall, SDI accounts for both MFpc and per capita CO₂. Since material footprint is not to blame, carbon footprint is the culprit.
That makes sense. Bahrain, Saudi Arabia, Qatar, and Brunei are all petrostates. The petroleum industry is the poster-child of carbon-dioxide emissions, but because petrostates export most of their petroleum products, their material footprints are largely unaffected. The exports are counted as part of the material footprints of the importing countries.
And then There were Two
Of the remaining two—Oman and Ireland—one is a crude-oil petrostate with a deceptively small, carbon footprint.* And the other is a corporate tax haven with an inflated GNI.
*Oman largely punts the refining process to other countries, which is where most of the CO₂ emissions occur.
*At least not without offloading the environmental costs or performing accounting sleight of hand (see insert above).
The standard for success must not be one of luxury but rather, socioeconomic agency and financial security. These, along with health and education, will be the lodestones of human progress towards a sustainable future.